Covalent bond symmetry breaking and protein secondary structure 
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Both symmetry and organized breaking of symmetry have a pivotal role in our understanding of 
structure and pattern formation in physical systems, including the origin of mass in the Universe 
and the chiral structure of biological macromolecules. Here we report on a new symmetry breaking 
phenomenon that takes place in all biologically active proteins, thus this symmetry breaking relates 
to the inception of life. The unbroken symmetry determines the covalent bond geometry of a sp3 
hybridized carbon atom. It dictates the tetrahedral architecture of atoms around the central carbon 
of an amino acid. Here we show that in a biologically active protein this symmetry becomes broken. 
Moreover, we show that the pattern of symmetry breaking is in a direct correspondence with the 
local secondary structure of the folded protein. 



Protein modeling is based on various well tested and broadly accepted stereochemical paradigms [T], [5]. These 
paradigms are instrumental in protein structure prediction [3j, and underlie the phenomenological force fields that 
describe protein dynamics [4 . The enormous success in resolving over 70.000 structures that are presently in Protein 
Data Bank (PDB) |5j is a clear manifestation that the various paradigms are valid to a high precision. Among the 
important paradigms is the assumption that the backbone Cq carbons are in a definite sp3 hybridized state, with 
its distinct tetrahedral geometry. For example the backbone rjvc = (N-Cq-C) bond angle should always fluctuate 
around a definite and computable value that only depends on the covalent bonds between the Cq and its adjacent 
Cp, N, C and H atoms. In particular, this value should not depend on the secondary structure environment. 

With the arrival of third-generation synchrotron X-ray sources and the ensuing rapid increase in the number of 
protein structures that are being resolved with an ultrahigh sub- Angstrom resolution it is now possible to experimen- 
tally scrutinize the validity of these paradigms. In particular any systematic, secondary structure dependent breaking 
of the covalent tetrahedral symmetry around the Cq carbons could help us to better understand why proteins fold 
and to predict more accurately how they fold. This could also have major repercussions to pharmaceutical drug 
development, and to help us better understand what is life. 

A molecular dynamics force field explicitely assumes that the tetrahedral symmetry remains unbroken. But ab 
initio quantum mechanical calculations |6j and empirical studies iZj-iOj have already pointed out that tetrahedral 
bond angles around a sp3 hybridized carbon may be subject to measurable fluctuations. For example, there is an 
estimate that the tnc = (N-Cq-C) bond angle could fluctuate as much as 8.8° [7] around its equilibrium position. 
This would have clearly measurable effects on the way how proteins fold. But the potential existence of a systematic 
and secondary structure dependent tetrahedral symmetry breaking have until now not been scrutinized. 

Here we address the presence of a systematic tetrahedral symmetry breaking by investigating the secondary structure 
dependence in the values of Tpfc, and in the adjacent r^p = (N-Cq-C^) and Tcp = (C-Cq-C^) bond angles. In order 
to diminish any bias towards paradigm based refinements we inspect several subsets in Protein Data Bank (PDB). 
These include the canonical one that comprises all PDB configurations with resolution 2.0 A or better, and its subsets 
with resolution better than 1.5 A, and better than 1.0 A. We also inspect a subset of the 2.0 A set that contains only 
those proteins that have less than 30% sequence similarity, and finally we also consider those proteins that appear 
in the CATH classification. We find that our conclusions are independent of the data set we use, and for illustrative 
purposes we use the canonical 2.0 A set. 

The conventional backbone Ramachandran torsion angles 0, ip and w relate to the backbone atoms N and C that 
we investigate. To diminish potential bias that may depend on refinement procedures, we here adopt the N and C 
independent, geometrically determined backbone Frenet frames; we follow the construction described in |10j . These 
frames depend only on the positions of the Cq, carbon coordinates with i = 1, n labeling the residues. We first 
introduce the unit backbone tangent (t) and binormal (b) vectors 

U = P^^ & b.^ ,^-^"*', (1) 
ki+i - r^l |ti_i X t,| 

With the unit normal vector = b^ x we have the full orthonormal Frenet frame at the location of each Cq. The 
bond angles and torsion angles are 



Ki = arccos(ti-|_i • t^) & Ti = arccos(bi_|_i • b^) 



(2) 
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The Frenet framing describes the position of all atoms of the protein, in the way how these atoms are seen by an 
imaginary observer who roller-coasts the backbone along the Cq. atoms with gaze direction always fixed towards the 
next Cq, [To] . In Figure 1 we display the statistical angular distribution of the backbone N and C and the side-chain 
atoms in our PDB data set, as they are seen by a Frenet frame observer who moves through all the proteins in our 
data set. The sphere is centered at the Cq,, and its radius coincides with the length of the (approximatively constant) 
covalent bond. We take the vector t that points towards the next Cq to be in the direction of the positive z-axis, so 
that with n in the direction of positive a::-axis we have a right-handed Cartesian coordinate system. We introduce the 
canonical spherical coordinates, so that the angle 6 € [0, tt] measures latitude from the positive z-axis and the angle 
(p e [0, 27r] measures longitude in a counterclockwise direction from the x-axis i.e. from the direction of n towards 
that of b. 




FIG. 1: The locations of a) backbone N-atoms, b) backbone C-atoms c) side-chain Cjs atoms, as seen by a Frenet frame observer 
located at the Ca carbon at the center of the sphere. In a) the (smaller) point-like direction of backbone N atoms corresponds to 
the L-a Ramachandran region. The larger region forms a segment of the great circle tp « —15°. Note that the loops interpolate 
latitudinally between a-hehces and ^-sheets. In b) the directions of backbone C form a segment of a small circle around z-axis, 
with 9 ~ 20°. The N and C oscillations become coupled into the horse-shoe shaped nutation of Cp as shown in Figure Ic). 



We find it remarkable that in the Frenet frame coordinate system, the N and C oscillations shown in Figures 1 (a) 
and 1 (b) become fully separated into the locally orthogonal 9 and directions respectively; this is not the case in 
a generic coordinate system. We also find it remarkable that secondary structures such as a-helices, /3-sheets, loops 
and left-handed a-regions are all clearly identifiable. Figure 1 (c) then reveals how the N and C oscillations become 
coupled into a horseshoe shaped nutation of C^. This nutation is similarly entirely determined by the local secondary 
structure environment, in an equally systematic manner. 

In Figures 2 (a)-(f) we plot the tetrahedral bond angles tmc, Tcp and tn/s separately for the a-helices, 3/10-helices 
and /3-strands; As in figure 1 the loops will continuously interpolate between these regular secondary structures. The 
Figures 2 (a) and 2 (d) clearly reveal that the t^c angles depend on the secondary structure in a systematic manner. 
But we observe no similar effect in either tc/3 or tn/b- (The isolated small peak in Figure 2 (b) and 2 (e) is due to 
prolines.) 

The fact that only tnc in Figure 2 displays systematic secondary structure dependence makes it plain and clear 
that the paradigm hybridized tetrahedral symmetry of the Cq carbon atomic orbitals is broken. In a folded protein the 
covalent tetrahedral structure around Cq is not unique. Instead, the backbone secondary structure breaks the ground 
state tetrahedral symmetry in a systematic manner which is fully determined by the local secondary structure. We 
note that for the loop regions, the tetrahedron geometry interpolates deterministically between those of the adjacent 
regular secondary structures. There is a one-to-one correspondence between the shapes of the Cq tetrahedra and the 
way how a protein folds. 

On the basis of the present PDB data we are unable to conclude whether the fact that the symmetry breaking is 
visible only in r^vc reflects a true physical effect, or is simply a consequence of the existing refinement procedures 
that place all the tension on the NC bond angle. We propose that these details of the symmetry breaking could be 
investigated in the new generation ultra-high resolution X-ray experiments. 

Any molecular dynamics approach 54^ to protein folding is based on a harmonic approximation of the energy around 
the equilibrium values of the bond angles. 



Ebond — ^ Kg {9 — Oq)"^ 
bonds 



(3) 
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FIG. 2: The probability density angular distribution of the tnc, tjV/S and rcfj angles (in degrees) separately for (i-strands (blue) 
and a-helices (red) in (a)-(c), and separately for 3/10 helices (yellow) and a-helices (again red) m (d)-(f). The secondary peak 
in (b) and (e) is proline. 

Here the equilibrium values 9q are determined using the paradigm that the sp3 symmetry of the amino acid remains 
unbroken and should have no dependence on the eventual secondary structure environment. However, Figures 1 
and 2 show unequivocally that in actual proteins these equilibrium values depend on the secondary structure in a 
systematic manner. We now proceed to investigate how to develop an energy function that describes this broken 
symmetry. Our starting point is the backbone energy function of |12j . there it has been shown how the collapsed 
PDB proteins can be described with experimental B-factor accuracy in terms of soliton solutions to a generalized 
discrete nonlinear Schrodinger (DNLS) equation. Indeed, the soliton solutions of DNLS equation share a remarkable 
history with protein research [13 , they were first introduced by Davydov to describe the propagation of energy along 
a-helices jI4] . Mathematically the DNLS equation is integrahle, there is an infinite hierarchy of conserved quantities 
|15j . Explicitely the backbone energy function is [12] 

N-l N . ^ C 1 

S = - ^ 2 K.+iK, + ^ hft^ + 9 . - m^f + Kjrf - Knln ' a^T, + ^r,f \ (4) 

Here the first sum together with the three first terms in the second sum comprise the energy function of the conventional 
DNLS equation, when expressed in the standard Hasimoto variables of fluid mechanics, see [10]- [12| for full details. 
The fourth (6,-) and fifth (flr) terms are the only two lower order nontrivial conserved quantities that appear in the 
integrable DNLS hierarchy prior to the energy. These are the momentum and the helicity, respectively. The last (ct-) 
term is the standard Proca mass term that we add for completeness, it could be ignored with only a minor effect on 
accuracy. Note in particular that any term odd in the ki is excluded by a global Z2 symmetry [10]. We also note 
that the next, higher order conserved quantity in the DNLS hierarchy is the energy function of the modified KdV 
equation [T5]. It could be added, but there is no point since with the present energy function we already reach the 
experimental B-factor accuracy. 

The remarkable property of Q is that the torsion angle appears at most quadratically so that it can be eliminated 
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explicitly by using equations of motion. The values of the torsion angle and consequently the entire Cq, backbone 
geometry becomes then fully determined by bond angle soliton solutions of a generalized DNLS equation [12] . 

The Figures 1 (a) and (b) reveal that the N and C atoms oscillate independently, in the latitudinal and longitudinal 
directions respectively. Consequently the ensuing contributions to the protein energy function should also be inde- 
pendent and depend only on the respective angular variables. Together these two independent contributions should 
then combine into the nutation of Figure 1 (c). 

Combining standard universality arguments with the spirit of the harmonic approximation ^ , we propose that the 
latitudinal and longitudinal contributions to protein energy only involve the two lower order conserved quantities in 
the integrable DNLS hierarchy and the Proca mass. This fixes the ensuing contributions uniquely, 



= ^1'^ K^^ifl - b^K^ifi - a^ip, + (6) 
1=1 ^ ' 

Accordingly the spherical angles {9i,ipi) are fully determined by the profile of the backbone bond angles Ki and the 
global parameters that are specific only to a given super-secondary structure. In particular, the tetrahedral symmetry 
breaking becomes driven by the degenerate ground state structure of the DNLS equation. (We note that the 9i and 
ipi variables are coupled to each other only indirectly, by the bond angles Ki. In particular, the long range interactions 
that are necessary for describing a collapsed protein are entirely due to the non-local character of the DNLS solitons.) 

Since ([s]), ([6| involves both latitudinal and longitudinal angles, the classical solutions of Q-Q can be utilized to 
describe both the backbone C^ and the side-chain C^ atoms in a folded protein. As an example we inspect the chicken 
villin headpiece subdomain IIP35 with PDB code lYRF. This is a naturally existing 35-residue protein with three 
a-helices separated from each other by two loops. The villin continues to be subject to very extensive studies both 
experimentally |16)-|18j and in silico |19j-|22). and |22j reports on a molecular dynamics construction of folded villin 
with overall backbone RMSD accuracy around one Angstrom. Since the force fields in [19]- [22] utilize the paradigm 
concept of unbroken Ca tetrahedral symmetry, the accuracy of in particular |22j can be adopted as a good measure 
of the symmetry breaking effect. 

We first solve for the classical equations of motion for Kj and Tt from Q, and then construct the remaining angular 
variables from ([s]), ([6| in terms of Ki. We use the iterative algorithm and procedure that has been described in 
[23], [12]. Using the parameters in Table 1 we reach an overall RMSD accuracy 0.39 A for the combined Cq-C^ 
configuration which is in line with the experimental B-factor accuracy; see Figure 3 that displays our solution in 
comparison with the PDB configuration. 

Symmetry breaking is a fundamentally important physical phenomenon, often intimately related to structure for- 
mation. Here we have shown that a protein backbone breaks the tetrahedral symmetry of the sp3 hybridized Cq. 
covalent bonds, in a manner that is entirely determined by the local secondary structure. We have also presented 
a simple energy function that accounts for the symmetry breaking, to compute the C^ nutation trajectories of the 
HP35 villin with experimental B-factor accuracy. Our observation is based on the available high precision PDB data, 
consequently detailed analysis of our symmetry breaking is experimentally feasible. Our observations should have 
wide applicability in the development of future refinement tools, and for constructing accurate theoretical and com- 
putational methods for investigating protein folding dynamics and structure. Indeed, the direct relation between the 
symmetry breaking and the protein fold geometry suggests that our broken symmetry is intimately connected to the 
underpinnings of protein folding and thus with the origin of life. 
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FIG. 3; A cartoon comparison of HP35 with our soliton solution summarized m Table 1. The combined Ca and C/j root-mean- 
square distance is 0.39 A which is in line with the experimental B-f actor accuracy. 
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TABLE I: Parameter values for the two-soliton solution that describes the two loops of lYRF with 0,39 A accuracy for both Ca 
and Cp atoms. The displayed RMSD values are for the individual solitons. The soliton-1 is located at Glu-45 - Phe-58 and the 
soliton-2 is located at Phe-58 - Lys-73. We utilize scale invariance to set all ce = = 1. Notice that the result is sensitive 
to the accuracy of parameters. This is because a folded protein is a piecewise linear polygonal chain with a positive Liapunov 
exponent. 



